Re-weightingΒΆ

OpenFisca-UK primarily relies on the Family Resources Survey, which has known issues with non-capture of households at the bottom and top of the income distribution. To correct for this, we apply a weight modification, optimised using gradient descent to minimise survey error against a diverse selection of targeting statistics. These include:

  • Country-level program statistics

  • Regional populations

  • Tax revenues by income range

  • Taxpayer counts by tax band

  • UK population

  • UK population age distribution

  • UK-wide program statistics

The graph below shows the effect of the optimisation on each of these, compared to their starting values (under original FRS weights). All loss subfunctions improve from their starting values.

import yaml
from openfisca_uk import REPO

with open(REPO / "calibration" / "losses.yaml", "r") as f:
    losses = yaml.safe_load(f)

component_cols = [
    col
    for col in losses
    if col not in ("Total training loss", "Total validation loss")
]
import numpy as np

for col in component_cols:
    arr = np.array(losses[col])
    losses[col] = arr[:2048].reshape((8, 256)).mean(axis=0)

import pandas as pd
import plotly.express as px

df = pd.DataFrame(losses)

px.line(df.rolling(10).mean() / df.iloc[0] - 1, y=df.columns).update_layout(
    template="plotly_white",
    legend_title="Loss metric",
    yaxis_tickformat=".0%",
    width=800,
    height=600,
    yaxis_title="Change after training",
    xaxis_title="Training steps",
)

Loss characteristic by training epochΒΆ

px.area(
    (df.rolling(1).mean().T / df[component_cols].sum(axis=1)).T,
    y=component_cols,
).update_layout(
    template="plotly_white",
    legend_title="Loss metric",
    yaxis_tickformat=".0%",
    width=800,
    height=600,
    yaxis_title="Change after training",
    xaxis_title="Training steps",
)

Changes to distributionsΒΆ

from openfisca_uk import Microsimulation

original = (
    Microsimulation(adjust_weights=False, duplicate_records=2)
    .calc("household_weight", 2022)
    .values
)
reweighted = Microsimulation().calc("household_weight", 2022).values

results = pd.DataFrame(
    dict(
        weight=reweighted / original - 1,
    )
)

weights = results.weight

px.histogram(
    x=results.weight, histnorm="probability", nbins=400
).update_layout(
    template="plotly_white",
    legend_title="Source",
    yaxis_tickformat=".1%",
    width=800,
    height=600,
    yaxis_title="Percent of weight values",
    xaxis_title="Relative change to weight",
)
print(
    f"The above histogram shows the distribution of changes to weights. Around {(weights <= -1).mean():.0%} of the original dataset is dropped: this does not mean that every value provided by those households is untrustworthy - rather, it implies that the error introduced by this household contaminating the dataset with any incorrect information outweighs the benefits that the household can provide over the other households by adding an additional data point. The maximum relative increase to weights is a {weights.max():.0f}-factor increase, and {(weights > 0).mean():.0%} of the households see their weight decrease."
)
The above histogram shows the distribution of changes to weights. Around 25% of the original dataset is dropped: this does not mean that every value provided by those households is untrustworthy - rather, it implies that the error introduced by this household contaminating the dataset with any incorrect information outweighs the benefits that the household can provide over the other households by adding an additional data point. The maximum relative increase to weights is a 26-factor increase, and 39% of the households see their weight decrease.
sim = Microsimulation()

sim.calc("household_weight", period=2022).values.sum()
28049952.0